Multi-byte characters

This chapter describes how the C language handles non-English characters.

Introduction to Unicode

The C language was created with only English characters in mind, and uses a 7-bit ASCII code for all characters.

However, when dealing with non-English characters, one byte is not enough; Chinese alone has at least tens of thousands of characters and the character set is bound to use multiple bytes for representation.

Initially, different countries had their own ways of encoding characters, which did not facilitate the mixing of multiple characters. As a result, there was a gradual move towards Unicode encoding, which puts all characters into one character set.

Unicode provides a number for each character, called a code point, where the parts 0 to 127 coincide with the ASCII code. A character is usually represented by a "U+hex code point", e.g. U+0041 for the letter A.

The Unicode code currently contains more than one million characters, with code points ranging from U+0000 to U+10FFFF. a complete representation of the entire Unicode character set requires at least three bytes. However, not all documents require that many characters; for example, a document in English, for which ASCII is sufficient, would be three times larger if each character were represented by three bytes than if it were represented by a single byte.

In order to accommodate different usage requirements, the Unicode Standard Committee has provided three different representations of Unicode code points

UTF-8: uses between one and four bytes to represent a code point. Different characters occupy different numbers of bytes.
UTF-16: For characters from U+0000 to U+FFFF (called the base plane), 2 bytes are used to represent a code point. Other characters use 4 bytes.
UTF-32: 4 bytes are used uniformly to represent a code point.

Of these, UTF-8 is the most widely used because for ASCII characters (U+0000 to U+007F) it uses only one byte, which is exactly the same as the ASCII encoding.

The C language provides two macros that indicate the length of the encoded bytes currently supported by the system. Both macros are defined in the header file limits.h.

MB_LEN_MAX: the maximum byte length for any supported region, defined in limits.h.
MB_CUR_MAX: maximum byte length for the current language, always less than or equal to MB_LEN_MAX, defined in stdlib.h.

Character representation

The essence of the character representation is that each character is mapped to an integer and then the character corresponding to that integer is obtained from the encoding table.

The C language provides different ways of writing integer numbers for representing characters.

\123: represents a character as an octal value, with three digits required after the slash.
\x4D: represents a character in hexadecimal, with \x followed by a hexadecimal integer.
\u2620: represents a character in Unicode code points (does not apply to ASCII characters), code points are represented in hexadecimal and four characters are required after \u.
\U0001243F: one character in Unicode code point (not applicable to ASCII characters), code point in hexadecimal, 8 characters required after \U.

printf("ABC\n");
printf("\101\102\103\n");
printf("\x41\x42\x43\n");

All three lines above will output "ABC".

printf("\u2022 Bullet 1\n");
printf("\U00002022 Bullet 1\n");

Both lines above will output "- Bullet 1".

Representation of multi-byte characters

The C language presupposes that only basic characters can be represented literally, all other characters should be represented as code points, and the current system must also support the encoding method for that code point.

By basic characters, we mean all printable ASCII characters except for three: @, $, and ````.

Therefore, when non-English characters are encountered, they should be written in Unicode code point form.

char* s = "\u6625\u5929";
printf("%s\n", s); // 春天

The above code will output Chinese“春天”。

If the current system is UTF-8 encoded, multi-byte characters can be represented directly in literal quantities.

char* s = "春天";
printf("%s\n", s);

Note that \u + code point and \U + code point cannot be used to represent ASCII characters (characters with code points smaller than 0xA0) except for three characters: 0x24 ($), 0x40 (@) and 0x60 (`` ```).

char* s = "\u0024\u0040\u0060";
printf("%s\n", s);  // @$`

The above code will output three Unicode characters "@$`", but no other ASCII characters can be represented in this way.

To ensure that the characters are interpreted correctly when the program is executed, it is best to switch the program environment to a localised environment.

setlocale(LC_ALL, "");

In the above code, setlocale() is used to switch the execution environment to the localised language of the system. The prototype of setlocale() is defined in the header file locale.h, see the section locale.h in the standard library section for details.

It is also possible to specify an encoding language like the following.

setlocale(LC_ALL, "zh_CN.UTF-8");

The above code switches the program execution environment to UTF-8 for the Chinese environment.

The C language allows the use of the u8 prefix to specify UTF-8 encoding for multi-byte strings.

char* s = u8"春天";
printf("%s\n", s);

Once a string contains multiple byte characters, it means that the number of bytes in the string no longer corresponds to the number of characters. For example, a string with a length of 10 bytes no longer contains 10 characters, but may contain only 7 characters, 5 characters, etc.

setlocale(LC_ALL, "");

char* s = "春天";
printf("%d\n", strlen(s)); // 6

In the example above, the string s contains only two characters, but strlen() returns 6, which means that the two characters take up a total of 6 bytes.

The C string functions are only valid for single-byte characters, but not for multi-byte characters, e.g. strtok(), strchr(), strspn(), toupper(), tolower(), isalpha(), etc. will not give the correct result.

Wide characters

The multi-byte string in the previous subsection has a variable byte width for each character. This encoding, although easy to use, is very detrimental to string processing, so the number of bytes occupied by each character must be checked one by one. Therefore, in addition to this method, C also provides a way of storing multi-byte characters of definite width, called wide character.

A "wide character" means that each character takes up a fixed number of bytes, either 2 or 4 bytes. This makes it easy to process quickly.

Wide characters have a separate data type wchar_t, and every wide character is of this type. It is an alias for the integer type and may be signed or unsigned, as determined by the current implementation. The type is either 16 bits (2 bytes) or 32 bits (4 bytes) long, enough to hold all the characters of the current system. It is defined inside the header file wchar.h.

Literals with wide characters must be prefixed with L, otherwise C will treat them as narrow character types.

setlocale(LC_ALL, "");

wchar_t c = L'牛'；
printf("%lc\n", c);

wchar_t* s = L"春天";
printf("%ls\n", s);

In the example above, the prefix "L" in front of the single quotes indicates a wide character, which corresponds to the %lc placeholder for printf(), and in front of the double quotes indicates a wide string, which corresponds to the %ls placeholder for printf().

Wide strings also have a null character at the end, but it is a wide null character that takes up multiple bytes.

To handle wide characters, you need to use wide character-specific functions, most of which are defined in the header file wchar.h.

Multi-byte character handling functions

mblen()

The mblen() function returns the number of characters occupied by a multibyte character. Its prototype is defined in the header file stdlib.h.

int mblen(const char* mbstr, size_t n);

It accepts two parameters, the first is a pointer to a multi-byte string, which will normally check the first character of that string; the second parameter is the number of bytes to be checked, which cannot be greater than the maximum bytes occupied by a single character on the current system, normally MB_CUR_MAX is used.

Its return value is the number of bytes occupied by the character. If the current character is an empty wide character, 0 is returned; if the current character is not a valid multi-byte character, -1 is returned.

setlocale(LC_ALL, "");

char* mbs1 = "春天";
printf("%d\n", mblen(mbs1, MB_CUR_MAX)); // 3

char* mbs2 = "abc";
printf("%d\n", mblen(mbs2, MB_CUR_MAX)); // 1

In the above example, the first character "spring" in the string "spring" takes up 3 bytes, and the first character "a" in the string "abc" takes up 1 byte. ", occupies 1 byte.

wctomb()

The wctomb() function (wide character to multibyte) is used to convert wide characters into multibyte characters. Its prototype is defined in the header file stdlib.h.

int wctomb(char* s, wchar_t wc);

wctomb() accepts two arguments, the first being an array of multibyte characters to be used as a target, and the second being a wide character to be converted. Its return value is the number of bytes occupied by the multibyte character store, or -1 if it cannot be converted.

setlocale(LC_ALL, "");

wchar_t wc = L'牛';
char mbStr[10] = "";

int nBytes = 0;
nBytes = wctomb(mbStr, wc);

printf("%s\n", mbStr);  // 牛
printf("%d\n", nBytes);  // 3

In the above example, wctomb() converts the wide character "cow" to a multibyte character, and the return value of wctomb() indicates that the converted multibyte character occupies 3 bytes.

mbtowc()

mbtowc() is used to convert a multibyte character to a wide character. Its prototype is defined in the header file stdlib.h.

int mbtowc(
   wchar_t* wchar,
   const char* mbchar,
   size_t count
);

It accepts 3 parameters, the first being a pointer to the wide character that is the target, the second being a pointer to the multibyte character to be converted, and the third being the number of bytes of the multibyte character.

Its return value is the number of bytes of the multibyte character, or -1 if the conversion fails.

setlocale(LC_ALL, "");

char* mbchar = "牛";
wchar_t wc;
wchar_t* pwc = &wc;

int nBytes = 0;
nBytes = mbtowc(pwc, mbchar, 3);

printf("%d\n", nBytes); // 3
printf("%lc\n", *pwc);  // 牛

In the above example, mbtowc() converts the multibyte character bull to the wide character wc, and the return value is the number of bytes occupied by mbchar (which occupies 3 bytes).

wcstombs()

wcstombs() is used to convert wide strings to multi-byte strings. Its prototype is defined in the header file stdlib.h.

size_t wcstombs(
   char* mbstr,
   const wchar_t* wcstr,
   size_t count
);

It accepts three parameters, the first parameter mbstr is a pointer to the target multibyte string, the second parameter wcstr is a pointer to the wide string to be converted, and the third parameter count is the maximum number of bytes to store the multibyte string.

If the conversion succeeds, its return value is the number of bytes of the successfully converted multibyte string, excluding the trailing string terminator; if the conversion fails, it returns -1.

Here is an example.

setlocale(LC_ALL, "");

char mbs[20];
wchar_t* wcs = L"春天";

int nBytes = 0;
nBytes = wcstombs(mbs, wcs, 20);

printf("%s\n", mbs); // 春天
printf("%d\n", nBytes); // 6

In the above example, wcstombs() converts the wide string wcs to the multi-byte string mbs, returning a value of 6 indicating that the string written to mbs takes up 6 bytes, excluding the trailing string terminator.

If the first argument to wcstombs() is NULL, the number of bytes of the target string needed for the conversion to succeed is returned.

mbstowcs()

mbstowcs() is used to convert a multibyte string to a wide string. Its prototype is defined in the header file stdlib.h.

size_t mbstowcs(
  wchar_t* wcstr,
  const char* mbstr,
  size_t count
);

It accepts three parameters, the first parameter wcstr is the target wide string, the second parameter mbstr is the multibyte string to be converted, and the third parameter is the maximum number of characters of the multibyte string to be converted.

On a successful conversion, it returns the number of multibyte characters successfully converted; on a failed conversion, it returns -1. If the return value is the same as the third argument, then the converted wide string does not end in NULL.

Here is an example.

setlocale(LC_ALL, "");

char* mbs = "天气不错";
wchar_t wcs[20];

int nBytes = 0;
nBytes = mbstowcs(wcs, mbs, 20);

printf("%ls\n", wcs); // 天气不错
printf("%d\n", nBytes); // 4

In the example above, the multibyte string mbs is converted to a wide string by mbstowcs(), which successfully converts 4 characters, so the function returns a value of 4.

If the first argument to mbstowcs() is NULL, then the number of characters the target wide string would contain is returned.

Introduction to Unicode​

Character representation​

Representation of multi-byte characters​

Wide characters​

Multi-byte character handling functions​

mblen()​

wctomb()​

mbtowc()​

wcstombs()​

mbstowcs()​